17 research outputs found
Dialogue history integration into end-to-end signal-to-concept spoken language understanding systems
This work investigates the embeddings for representing dialog history in
spoken language understanding (SLU) systems. We focus on the scenario when the
semantic information is extracted directly from the speech signal by means of a
single end-to-end neural network model. We proposed to integrate dialogue
history into an end-to-end signal-to-concept SLU system. The dialog history is
represented in the form of dialog history embedding vectors (so-called
h-vectors) and is provided as an additional information to end-to-end SLU
models in order to improve the system performance. Three following types of
h-vectors are proposed and experimentally evaluated in this paper: (1)
supervised-all embeddings predicting bag-of-concepts expected in the answer of
the user from the last dialog system response; (2) supervised-freq embeddings
focusing on predicting only a selected set of semantic concept (corresponding
to the most frequent errors in our experiments); and (3) unsupervised
embeddings. Experiments on the MEDIA corpus for the semantic slot filling task
demonstrate that the proposed h-vectors improve the model performance.Comment: Accepted for ICASSP 2020 (Submitted: October 21, 2019
LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech
Self-Supervised Learning (SSL) using huge unlabeled data has been
successfully explored for image and natural language processing. Recent works
also investigated SSL from speech. They were notably successful to improve
performance on downstream tasks such as automatic speech recognition (ASR).
While these works suggest it is possible to reduce dependence on labeled data
for building efficient speech systems, their evaluation was mostly made on ASR
and using multiple and heterogeneous experimental settings (most of them for
English). This questions the objective comparison of SSL approaches and the
evaluation of their impact on building speech systems. In this paper, we
propose LeBenchmark: a reproducible framework for assessing SSL from speech. It
not only includes ASR (high and low resource) tasks but also spoken language
understanding, speech translation and emotion recognition. We also focus on
speech technologies in a language different than English: French. SSL models of
different sizes are trained from carefully sourced and documented datasets.
Experiments show that SSL is beneficial for most but not all tasks which
confirms the need for exhaustive and reliable benchmarks to evaluate its real
impact. LeBenchmark is shared with the scientific community for reproducible
research in SSL from speech.Comment: Will be presented at Interspeech 202
LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech
Self-supervised learning (SSL) is at the origin of unprecedented improvements
in many different domains including computer vision and natural language
processing. Speech processing drastically benefitted from SSL as most of the
current domain-related tasks are now being approached with pre-trained models.
This work introduces LeBenchmark 2.0 an open-source framework for assessing and
building SSL-equipped French speech technologies. It includes documented,
large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous
speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to
one billion learnable parameters shared with the community, and an evaluation
protocol made of six downstream tasks to complement existing benchmarks.
LeBenchmark 2.0 also presents unique perspectives on pre-trained SSL models for
speech with the investigation of frozen versus fine-tuned downstream models,
task-agnostic versus task-specific pre-trained models as well as a discussion
on the carbon footprint of large-scale model training.Comment: Under submission at Computer Science and Language. Preprint allowe
Intégration de sources de connaissances pour la modélisation stochastique du langage appliquée à la parole continue dans un contexte de dialogue oral homme-machine
AVIGNON-BU Centrale (840072102) / SudocNANCY-INRIA Lorraine LORIA (545472304) / SudocSudocFranceF
Stochastic Finite State Automata Language Model Triggered by Dialogue States
Within the framework of Natural Spoken Dialogue systems, this paper describes a method for dynamically adapting a Language Model (LM) to the dialogue states detected. This LM combines a standard n-gram model with Stochastic Finite State Automata (SFSAs). During the training process, the sentence corpus used to train the LM is split into several hierarchical clusters in a 2-step process which involves both explicit knowledge and statistical criteria. All the clusters are stored in a binary tree where the whole corpus is attached to the root node. Each level of the tree corresponds to a higher specialization of the sub-corpora attached to the nodes and each node corresponds to a different dialogue state. From the same sentence corpus, SFSAs are extracted in order to model longer contexts than the ones used in the standard n-gram model. A set of SFSAs is attached to each node of the tree as well as a sub-LM which combines a bigram trained on the sub-corpus of the node and the SFSAs selected. A first decoding process calculates a word-graph as well as a first sentence hypothesis. This first hypothesis will be used to find the optimal node in the LM tree. Then, a rescoring process of the word graph using the LM attached to the node selected is performed. By adapting the LM to the dialogue state detected, we show a statistically significant gain in WER on a dialogue corpus collected by France Telecom R&D